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1. Introduction 


The scope of this project dealt with the investigation of the requirements to support 
distributed computing of scientific computations over a cluster of cooperative workstations. 
Various experiments on computations for the solution of simultaneous linear equations were 
performed in the early phase of the p[roject to gain experience in the general nature and 
requirements of scientific applications. A specification of a distributed integrated computing 
environment, DICE, based on a distributed shared memory communication paradigm has 
been developed and evaluated. The distributed shared memory model facilitates porting 
existing parallel algorithms that have been designed for shared memory multiprocessor 
systems to the new environment. The potential of this new environment is to provide 
supercomputing capability through the utilization of the aggregate power of workstations 
cooperating in a cluster interconnected via a local area network. 



The great majority of scientific applications require a fairly large amount of memory to 
execute a task. If a task is to be partitioned into threads (sub-tasks) that are executed in 






parallel, memory sharing is very desirable since it allows sharing variables among threads 
within the same task. Shared memory multiprocessor systems have been the predominant 
platform selected for executing large scientific applications for these reasons. 

Workstations, generally, do not have the computing power to tackle complex scientific 
applications, making them primarily useful for visualization, data reduction, and filtering as 
far as complex scientific applications are concerned. There is a tremendous amount of 
computing power that is left unused in a network of workstations. Very often a workstation is 
simply sitting idle on a desk. A set of tools can be developed to take advantage of this 
potential computing power to create a platform suitable for large scientific computations. 
The integration of several workstations into a logical cluster of distributed, cooperative, 
computing stations presents an alternative to shared memory multiprocessor systems. In this 
project we designed and evaluated such a system. 

Attached to this report are three papers published or accepted for publication, resulting from 
this research project. These articles are: 

1. Hasan S. AlKhatib, Qiang Li, Chi-Jiunn Jou, Tiekun Chen and Hassan Arafeh ’’DICE 
- a Distributed Integrated Computing Environment for Multi-threaded Parallel 
Processing”, Proceedings of the Third International Systems Integration Conference, 
Sao-Paulo, Brazil, August 15-19, 1994, pp 612-621. 

2. Chi-Jiunn Jou, Hasan S. AlKhatib, Qiang Li and Tiekun Chen ’’Coherency Protocol and 
Algorithm of the DICE Distributed Shared Memory”, Proceedings of the ISCA 
International Conference on Parallel and Distributed Computing Systems, Las Vegas, 
NV, October 6-8, 1994. pp 796-801. 

3. Chi-Jiunn Jou, Hasan S. AlKhatib and Qiang Li ”Two-Tier Paging and Its Performance 
Analysis for Network-based Distributed Shared Memory Systems”, accepted for 
publication in the IEICE Transactions on Information and Systems. 

2. DICE Overview 

DICE is a computing environment for executing multi-threaded tasks on a cluster of 
networked workstations. In DICE, threads of a parallel task may run on separate 
workstations sharing the same virtual address space. Threads communicate with each other 
using shared memory. An overall system structure of DICE is shown in Figure 1. 

DICE consists of three interactive subsystems: a distributed shared memory ( DSM ), a parallel 
scheduler (PS), and a distributed run-time subsystem ( DRS ). DSM provides mechanisms for 
sharing distributed memory among threads of a parallel task and hence supports the 
underlying computing and communication paradigm. PS provides tools to initiate both local 
and remote threads and to coordinate their execution over different workstations. DRS 
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System structure for DICE 


provides the programmers interface to develop parallel tasks as well as the run-time 
environment for their execution. 

3. Distributed Shared Memory 

In DICE , the physical memories of individual workstations in a cluster are treated as 
resources for the virtual space of a multi-threaded parallel task. Pages of the address space of 
a task can be shared among the threads of the same task. A task consists of multiple threads 
that can run on different workstations in a cluster simultaneously. The virtual memory of 
DICE is divided into private and shared spaces. Private space is local to a single workstation, 
and is not shared among threads. An example of private space is the stack of a thread. Shared 
space is global to all workstations, and is shared among all threads of a parallel task. Shared 
space is further divided into read-only code and read-write data spaces. The initial 
implementation of DICE will only support the shared data space. 

DICE presents a new distributed shared memory design to attack the problems of false 
sharing and thrashing. False sharing may occur in a typical distributed shared memory system 
such as Ivy[l], since its consistency or access unit (eg. per word) is less than the sharing unit 
(per page). The single-write nature of its coherency protocol may cause a ’’ping-pong” 
behavior between multiple writers of a shared page, or the thrashing problem. To overcome 






these problems, Mach[2] uses a shared memory server to perform the fault scheduling via a 
queueing mechanism[3]. Mether[4,5] avoids these problems through the use of the 
inconsistency. Clouds[6] avoid these problems by using a single-write-single-reader strict 
coherence semantics. Mirage[7] reduces the effect of these problems by using a time window 
scheme, in which the system guarantees that the writer of a page retains access to a page for a 
fixed period of time. Munin[8] minimizes these problems by using multiple type-specific 
coherency protocols. 

To overcome these false sharing and thrashing problems, DICE DSM uses a hybrid memory 
granularity and supports multiple coherency protocols. Shared memory is structured as a 
two-layer paging system. The higher layer is a page, which is the same as the one in an existing 
system. The lower layer is a paragraph, which is a small fixed-sized memory region within a 
page. The memory sharing unit is a page, while the coherency unit is a paragraph. Each page 
in the shared address space is divided into several small equal-sized paragraphs. Each 
paragraph uses one and only one specific protocol at a time. The protocol used on a 
paragraph can be changed to adapt to new application requirement. The default protocol 
used on a paragraph is that of inconsistent memory, which only provides memory sharing 
without coherency. Other coherency protocols include write-invalidate, write-update, 
write-read-migrate, home-read-write, release-update, and entry-invalidate. 

Write-invalidate, write-update, write-read-migrate, and home-read-write protocols 
provide a strict consistency on copies of a shared paragraph. They resemble the 
read-replication, full-replication, migration, and central algorithms in [9] respectively. Both 
release-update and entry-invalidate protocols provides weak consistency memory model on 
copies of paragraph. The weak consistency memory model is different from the strict 
consistent memory model in that it does not guarantee memory coherency without the use of 
explicit high-level synchronization operations. Parallel programs, therefore, would need to 
impose an ordering on accesses to shared memory by using synchronization operations. This 
protocol treats shared memory accesses differently from synchronization variable accesses. 
The model supports two types of synchronization accesses: acquire and release. Similar to the 
software release consistent protocol used in [8], release-update protocol ensures that all 
previously modified data is updated before the release is performed on a synchronization 
variable. Similar to the entry consistent protocol used in [10], entry-invalidate protocol 
ensures that a consistent copy of paragraphs are pre-fetched when the acquire or entry of a 
synchronization variable is performed. 

DICE DSM is similar to Munin[8] system, since both of them use multiple type-specific 
coherency protocols. However, the kinds of protocol support and their designs are different 



between them. More significant difference between them are the memory structure and 
granularity. DICE DSM uses fixed-sized paragraph flat memory space, while Munin uses 
variable-size object structure memory space. The advantage of using fixed-sized paragraph 
is that it allows the DSM to be implemented in hardware like MemNet [11]. This will improve 
the performance significantly, and is the final prototype of DICE DSM. 

DICE separates synchronization mechanism from shared memory. It supports two kinds of 
synchronization variables locks and barriers. Whereas locks are used primarily for access 
control, that is, to resolve competition among parallel threads, barriers are used for sequence 
control, that is, to ensure correct timing among cooperating threads. Other kinds of 
synchronization variables can be built on top of them. DICE uses distributed queueing 
schemes for both lock and barrier synchronization protocols. 

4. Parallel Scheduler 

DICE PS is a self optimizing application specific scheduler. It is responsible for thread 
scheduling and synchronization. The PS is implemented as a thread within the parallel task. 
Each parallel task has one PS running on the workstation where the task initially start to run. 
This special thread is created during application load-time. 

When an application needs to create another thread or to terminate itself by joining with 
other threads, it passes control of the execution to the PS. The PS will find the fastest way to 
run the application by using the information in the task execution dependence tree, which is 
created as an auxiliary file during the compiling of the source program. 

The PS decides whether the local workstation has enough resources to run the different 
threads, which threads to send to remote workstations to run, and which remote workstations 
to send them to. It uses several tools to make intelligent decisions at run time. Those tools 
are: CPU load estimator, network load estimator, an intelligent database, and the bidding 
process. 

The CPU load estimator runs on every workstation on the network and keeps track of the 
load on that workstation. The network load estimator monitors the traffic on the network, 
and helps the parallel scheduler in avoiding heavily loaded networks. A small and efficient 
database records thread performance on each workstations under different CPU and 
network load conditions. This database helps the bidding process by giving the workstations a 
reasonable estimate of the expected run times of various threads. 

When the parallel scheduler decides that it is best to send some threads to a remote 
workstation to run, it needs a way to pick those workstations. Instead of forcing other, 



possibly heavily loaded, workstations to take some of the threads, the parallel scheduler asks 
for help through the bidding process. It simply asks for help in running a given thread and tells 
the other workstations about the memory and CPU requirements of the thread. This 
information is found in the intelligent database. The detail design of PS is based on our 
previous work [12]. 

5. Distributed Run-Time Subsystem 

DRS transforms the DICE DSM from a flat space into an object-oriented structured space. 
DICE DRS consists of a set of tools that implement the DICE. Application Programmer’s 
Interface, API, provides users with programming tools to develop and execute DICE 
multi-threaded applications. The tools used during program development include a parallel 
language and its compiler, library interface functions, and a linker. 

A new Object-Oriented Dataflow language(OODL) will be designed used as the parallel 
language used in DICE. One of the important features of object-oriented programming is 
information hiding and encapsulation [13,14]. It provides a higher level of data abstraction in 
modeling real world objects. Such concepts are helpful in designing parallel programs [13]. In 
general, parallel programs are difficult to design because the programmer must consider 
multiple execution threads instead of a single thread. All possible interactions among the 
threads must be considered. Also, parallel programs are hard to maintain because a simple 
change may affect the interaction pattern and result in global consequences. Information 
hiding helps in reducing possible interactions that need to be considered, while data 
encapsulation help in minimizing the maintenance effort when program changes are needed. 

While the object-oriented model provides a high level of programming abstraction, it does 
not naturally exploit parallelism of applications constructed with objects. A dataflow model 
can expose and exploit the maximum amount of parallelism, as well as express data 
dependence from different levels of abstraction in a very natural way. The combination of the 
object oriented and dataflow concepts makes it easier for programmers to design large scale 
multi-threaded parallel programs, and to build re-usable concurrent software modules. 

The OODL language, in DICE, will be an extension of the object-oriented programming 
language C + + . Dataflow constructs will be added to allow programmers to express 
parallelism explicitly. The parallel compiler can be realized using a preprocessor to translate 
the extended source code into C + + programs, which in turn are compiled into object code 
using an existing C++ compiler. 

The run-time library interface functions provide a collection of library routines that are 
linked with each parallel program. They are invoked to support the service requests made by 



system processes at run-time. The OODL compiler will use these functions to realize the 
parallelism expressed in the application programs. These functions can also be used by the 
application directly. 


6. Conclusions 

The key results accomplished in this project include: 

1 . A design of a distributed shared memory system for distributed networked computing that 
solves the problem of false-sharing. The DSM employs a two-tier paging scheme and a 
set of management protocols and algorithms suitable for hardware support within the 
architecture of a workstation. 

2. The DSM scheme was evaluated analytically. The results verify the validity of benefit of 
the two-tier paging scheme in solving the problem of false-sharing. 

3. The DSM was alo simulated using the Block Oriented Network Simulator, BONeS, and 
was driven by a trace from a scientific application chosen from the Stanford’s SPLASH 
benchmarks. The results of the simulation confirmed the results of the analytical work 
and also verified the utility of the use of the two-tier paging schem. 

The papers attached to this summary report contain further details of the work performed 

under this project. 
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Coherency Protocol and Algorithm of The DICE Distributed Shared Memory 
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Abstract 

(D J Sm f! ed Integrated imputing Environment) 
DSM (Distnbuted Shared Memory) is an experimental sys- 
tem, being developed at Santa Clara University, which sup- 
ports the execution of multiple threads on a cluster of net- 
worked workstations. This paper presents the coherency 
protocol and algorithm of DICE DSM. which is a novel ap- 
proach to the design of the virtual-memory based DSM. In 
DICE DSM, the shared memory uses a two-tier paging sys- 
tem. The first tier, page , is the common page used in an oper- 
ating system. The second tier is called a. paragraph, which is a 
smaller fixed-sized unit of memory contained within a page. 
The introduction of paragraphs improves system performance 
by reducing the probability of false sharing as well as the size 
of the unit of information transferred over the network for 
maintenance of memory coherency. 

Keywords: coherency protocol and algorithm, distrib- 
uted shared memory, local area network. 

1. Introduction 


in which the system guarantees that the writer of a page retains 
access to a page fora fixed period of time. Munin [2] handles it 
by using multiple consistency protocols and software release 
consistency. Mether [6] reduces false sharing and thrashing 
through the use of the incoherent memory. 

DICE (Distributed Integrated Computing Environ- 
ment)! I J presents a novel approach to handle the problem of 
false sharing and thrashing. The shared memory is structured 
as a two-tier paging system. The first tier, called page, is the 
page commonly used in an operating system. The second tier 
is called a paragraph, which is a smaller fixed-sized block of 
memory within a page. Paragraph is the coherency unit. The 
introduction of paragraphs improves system performance by 
reducing the probability of false sharing as well as the size of 
the unit of information transferred over the network for main- 
tenance of memory coherency. 

An overview of the DICE DSM architecture is given in 
section 2. Section 3 presents the memory coherency protocol. 
The algorithm for realizing the complete DSM protocol is 
presented in section 4. Section 5 discusses the expected sys- 
tem performance and concludes- 




A Distributed Shared Memory (DSM) system support 
the sharing of a virtual address space among processes Vun 
ning on looseiy-coupled processors. A number of DSM svs 
tarns over LANs have been developed [8]. Among them. Ivy 
[5] is implemented on a network of Apollo workstations. Thi 
memory is paged, and copies of pages may be replicated it 
different hosts. Strict coherency semantics are used, and the 
memory coherency is maintained by a write-invalidate with 
dynamic ownership protocol. The owner of a page is located 
via either a centralized manager, fixed distnbuted managers 
or an individual host which forwards the request. Ivy is used 
for applications employing multi-threaded tasks. All threads 
share the same virtual address space. False shanng may occur 
in this system, since its consistency or access unit (e.g word) 
is less than the sharing unit (page). In addition, the single- 
wnte nature or us protocol may cause a "ping-pong” behavior 
between multiple wnters of a shared page. 

To overcome false-shanng and thrashing, some systems 
employ special schemes. Clouds [7] avoids them by using a 
smgie-wnter-smgle-reader strict coherence semantics. Mi- 
rage [3] reduces thrashing is by using a time window scheme. 


This work was supported by NASA- Ames Research Center grams 
number NCC 2-644 entitled "Parallel Processing for Sdenufic Com- 
putations" 


2. The DICE DSM Architecture 

DICE is an experimental distributed computing system 
which aims at providing a computing environment for the ex- 
ecution of multi— threaded tasks. A parailei task may consist 
of multiple threads that can be scheduled to run on a cluster of 
workstations simultaneously. A thread is an active program 
entity that provides the notion of a computation. Threads on 
separate workstations also share the same virtual address 
space, and communicate with each other using shared 
memory. Synchronization of threads accessing shared re- 
sources is done using functions provided by a distnbuted run- 
time library. 

Figure 1 shows the system structure of DICE. It consists 
of three interactive subsystems. DRS (distnbuted run-time 
subsystem) provides users with programming tools to develop 
and execute DICE multi— threaded applications. DSM (dis- 
tributed shared memory) provides the underlying communi- 
cation and computing paradigm for threads of a parallel task. 
PS (parallel scheduler) is a seif-optimizing application-spe- 
cific scheduler, and is responsible for thread scheduling and 
synchronization. 

In addition to a host processor and memory, each node in 
DICE also has a network processor and a Distributed Shared 
Memory Management Unit ( DSMMU ). DSMMU is an exten- 
sion of the traditional MMU which supports paragraph vaiida- 




tion/invaiidation to achieve efficient management of the 
DSM. When data is not available locally and needs to be 
fetched from a remote host, the DSM MU triggers a special ac- 
cess fault, otherwise, the DSMMU performs the traditional 
TLB operations. 

3- Coherency Protocol 

In DICE, a parallel task consists of multiple threads that 
run on a cluster of workstations (hosts), simultaneously. 
Shared data can be distributed and replicated on the physical 
memory of the members of a cluster. The DSM system sup- 
ports the sharing of virtual pages, and maintains coherency 
among replicated data copies across the network. A parallel 
task has a root host, on which it was first loaded and executed. 
The root host, maintains the state information for all shared 
pages used by the task. Other hosts in the cluster maintain the 
state information for the shared pages that are cunendy in 
their local physical memories. 

In DICE each shared page of a parallel task has a home 
host . A home host maintains the state information for its 
pages, and ensures that the last copy of a page is not purged, 
and keeps track of all copies of the paragraphs of its pages. 
Other hosts in the cluster that have a copy of a page keep a 
pointer of the home host. When a thread makes an attempt to 
access a page for which it does not have a copy, it communi- 
cates with the home host of the respective page in order to 
complete the memory access transaction. When a host does 
not know the home host for a certain page, a home— info fault 
will be tnggered and a home-info request will be sent to the 
root host. The root host replies with the information about the 
home host for the requested page. If the home host is not yet 
assigned, the root host will assign the first requesting host as 
the home host for the requested page. The root host will then 
update its database and send to the requesting host a reply in- 
forming this assignment. 

The memory coherency of DICE DSM is maintained on 
the paragraph level. A paragraph can simultaneously be read 
by multiple hosts, but it can only be written by one host at a 
time. Access rights to a paragraph can be read-write , read- 
only, or none . An owner host is the most recent host that have 


read-write access to that paragraph. The ownership of a para- 
graph may be transferred from one host to another. There is 
no ownership when two or more hosts have read-only access 
rights to that paragraph. The Information about the ownership 
of a paragraph is maintained at the home host of the page con- 
taining the paragraph. 

When a read operation is issued to a paragraph by a host 
with none rights, a read-data fault will be tnggered and a 
read-data request will be sent to the paragraph’s home host. 
When a write operation is issued to a paragraph by a host with 
none right, a write-data fault will be tnggered and a write-da- 
ta request will be sent to the paragraph’s home host. When a 
write operation is issued to a paragraph by a host with read- 
only access right, a write-access fault will be tnggered and a 
write-access request will be sent to the paragraph’s home 
host. Tn each case the home host directly or indirectly re- 
sponds with the requested information. 

At initialization, a home host is the default owner host for 
ail paragraphs within its respective pages. Any other host will 
send a remote request to the home host when it tries to access a 
paragraph of this page. If a read-data request is received, the 
home host will return a reply containing the most recent copy 
of the desired paragraph when it is the owner host or there is no 
owner host of that paragraph. The access rights of both home 
and requesting hosts are changed to read-only. If the home 
host is not the owner host, it will forward this read-data re- 
quest to the owner host of that paragraph. The latter changes 
its access right to read-only, and then sends to both home and 
requesting hosts a reply containing the most recent copy of 
that paragraph. After it receives the reply, both home and re- 
questing hosts changes their access right to read-only . Home 
host will also reset the owner host of that paragraph to none. If 
it is the requesting host, the home host will directly send the 
read— data request to the owner host. The latter changes its ac- 
cess right to read-only, and then sends back a reply which con- 
tains the most recent copy of that paragraph. Having received 
this reply, the home host changes its access right to read-only 
and resets the owner host of that paragraph to none. 

If a write-data request is received, the home host will re- 
turn a reply containing the most recent copy of the desired 
paragraph if it is the owner of this paragraph. If multiple valid 
copies exist, the home host will send invalidate requests to all 
hosts holding the copies, and wait for confirmations from all 
of them before returning the reply. Upon receiving the invali- 
date request, each host changes its access right of that para- 
graph to none and returns its confirmation to the home host. 
The access right of the home host is then changed to none , 
while the requesting host becomes the owner host and its ac- 
cess right is changed to read— write. If the home host is not the 
owner host, it will forward this write-data request to the own- 
er host of that paragraph. The latter changes its access right to 
none , and then directly sends to the requesting host a reply 
containing the most recent copy of that paragraph. After re- 
ceiving the reply, the requesting host changes its access right 
to read— write and sends a confirmation message to the home 
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host. Having received this confirmation message, the home 
host updates its database and records that the requesting host 
becomes the owner host of that paragraph. If the home host is 
the requesting host, it will directly send the write-data request 
to the owner host. The latter changes its access right to none, 
and then sends back a reply which contains the most recent 
copy of that paragraph. Having received this reply, the home 
host changes its access right to read-write and becomes the 
owner host of that paragraph. 

If a write— access request is received, the home host will 
return the write-access confirmauon when it is the owner of 
that paragraph. If multipie valid copies exist, the home host 
will send invalidate requests to all hosts (except the request- 
ing one) holding the copies, and wait for confirmations from 
ail of them before returning the confirmation message. Upon 
receiving the invalidate request, each host changes its access 
right of that paragraph to none and returns its confirmation to 
the home host. The access right of the home host is then 
changed to none , while the requesting host becomes the owner 
host and its access right is changed to read-write. If the home 
host is the requesting host, it will directly send the invalidate 
requests to ail hosts (except the requesting one) holding the 
copies and wait for confirmations from ail of them. Upon re- 
ceiving the invalidare request, each host changes its access 
right to none and returns its confirmation to the home host 
The home host then changes its access right to read-wnte and 
becomes the owner host of that paragraph. 

Figure 2 shows the state diagram representing the loca- 
tion of a valid paragraph. This state diagram reflects the pro- 
tocol described above. At any time, the location state of a val- 
id paragraph is either none , at home host , at owner host ; or at 
multiple hosts . The state is initially set to none when a home 



host has not yet been assigned. A home-info fault and request 
made by any host forces the root host to assign the requesting 
host to become the home host. The state is then changed to at 
home host . In this case the home host is the owner of the para- 
graph. 

A paragraph will leave the at home host state when either 
a read-data or a write-data fault occurs. A read-data fault 
and request at any non-home host causes the paragraph to 
transit to the at multiple hosts state. In this case there is no 
owner host and multiple hosts have valid copies ( with read- 
only access rights) of the paragraph. Mote that these multiple 
hosts always include the home host. A wrue-data fault and 
request causes the paragraph to transit to the at owner host 
state, where the requesting host becomes the owner host of the 
paragraph. 

The paragraph will leave the at multiple hosts state when 
either a write-access or a write-data fault occurs. A write- 
access or a write-data fault and request at any other non- 
home host causes the paragraph to transit to the ax owner host 
state. A write-access fault and request at the home host 
causes the paragraph to transit to the at home host state. A 
read-data fault and request at any other non-home host will 
still keep the paragraph in the at multiple hosts state. Note that 
a read-data or a write-data fault will never occur at the home 
host; since a home host has a valid copy of the paragraph (with 
read-only access rights) in the at multiple hosts state. 

The paragraph may leave the at owner host state when ei- 
ther a read-data or a write— data fault occurs. A read-data 
fault and request at any other host causes the paragraph to 
transit to the at multiple hosts state. A write-data fault and 
request at the home host causes the paragraph to transit to the 
at home host state. A write-data fault and request at any other 
non— home host causes a change of ownership, but the para- 
graph will sdll be in the at owner host state. 

4. -Coherency Algorithm 

To support the above protocol, a Page table {PT) and a 
paragraph table ( ParT) are used to maintain the state infor- 
mation about shared pages and paragraphs. Each DICE appli- 
cation maintains its own set of these tables. In addition to the 
address mapping information and flags, PTalso maintains the 
information about the location of home host of each shared 
page. This location information is denoted by the home host 
identifier or hid. ParT maintains the information about the ac- 
cess rights to each paragraphs ( acc ). The ParT of the home 
host also maintains the location of the owner host of a para- 
graph ( oid ), and the set of hosts (excluding the home host) 
which have read-only copies of the paragraph (copyset). 

The coherency algorithm handles various kinds of para- 
graph validation faults as described in section 3. These faults 
include home-info , read-data, write-data, and write-access 
faults. We divide the algorithm into four parts, corresponding 
to the four fault types. Each part of this algorithm consists of a 
fault handler and its server, as illustrated in Figures 3 to 6 for 
the respective fault type. Note that p and g, which are used 
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within the algorithm, denote the current page and paragraph 
numbers, respectively. 



5 - Discussions and Conclusions 

We have presented the memory coherency protocol and 
algorithm of DICE DSM. The coherency protocol for this 
two-tier paging system is now being simulated in software. 
The performance of DICE DSM system has been studied us- 
ing an analytical model [4], which derives an expression for 
the speedup of the parailei pan of an application {or S p ). In 
this analysis, a high-speed and low-latency ATM LAN is cho- 
sen as the underline platform, and the queuing time on the net- 
work is assumed to be negligible. The memory access unit is 
assumed to be four bytes (or one word). Each page has P bytes 
and k paragraphs per page. An application is executed by N 
hosts, and uses M bytes of shared memory space. The behav- 
ior of an application is represented by the percentage of data 
memory accesses for total instructions (denoted by i); the 
probabilities of read and write faults (denoted by and iV^, 
which are the number of read faults and write faults per 
1.000,000 memory references per host); temporal locality 
(denoted by x r , which is the number of times that the same 
paragraphs accessed continuously by a host); and spatial 
locality factor (denoted by x t , which is the probability of a 
certain region of shared memory being accessed by a specific 
host). The temporal locality x t is further represented by a 
step uniform distribution (with parameters , N x , and g t 
which are the starting pointer, ending pointer, and window siz- 


R«ad_dsta paragraph fault handler* 
if* (I am homa host) 

BEGIN * /* ‘at owomr nosC s» to -> 'al mjittpf hosts' $ui« */ 

sand r»ad_data raquast to Par pg]. art: 
raeawa read^data reefy from ParTfgj.oitf: 

Par TtgJ.cooyaai - {ParTTgJ.od); 

ParT[gj.o»d » none: 

ENO 

ELSE 

BEGIN 

sand raad^data request to PTfp} Jiid: 
racanra read data reciy from o v*w hose 

ENO; 

updata local copy of g; 

RarTfgj.acc * read-only; 
undock host processor 

return; 

Read_cfeta paragraph fault server 
IF (1 am home host) 

BEGIN r ‘at mritioi* hosts" state - no state change V 
IF (no owner host) 

BEGIN 

sand readjusts reply to requesting hose 
ParTXgJ.cooysat-FarTtgl.copyaai+^raquasnng host); 

ENO 

ELSE IF (1 am owner host) 

BEGIN r u 3t home host* state -> ‘ at mmto* hosts? state •/ 

ParTfgJ.acc - read-only; 
sand read jaara reply to requesting host; 

Farr(gJ.cooysat - {requesting host); 

Par Tig J. okJ - none; 

ENO 

ELSE r owner host is not me •/ 

BEGIN r ‘at owner host* state -^‘it muttip* hosts' state */ 

forward reed_data request to ParTTgj.OKfc 
block processsrg future requests for g; 
reoavwe rmmd catM reoN from owner host; 
update local copy of g; 

ParTtgJ.aoc - reed— only; 

ParTtgj.copysat - {Part(gJ.ouirequesang host>; 

Part(g|.oui - nonet 

unblock processing future r eq u es ts for g; 

ENO 

010 

ELSE r t am owner host but not home host */ 

BEGIN 

Part(gJ.*ce. - raad— only; 

IF (raqueatina host Is home host) 

sand reed date reply to home hose 

ELSE 

send reed data reply to both home and requesting hosts; 

ENO 

re&jnr 


Bgurae. The algorithm for handling readjusts paragraph faults 


e of this step), which approximates the bell— like normal dis- 
tribution reflecting intuition that the chance of a memory 
location being accessed by a host decreases as the distance 
grows from the previously accessed location. 

The effects of changing S p on system structure and 
application behavior has been srudied. and some of these re- 
sults are shown below. Figure 7 shows that the gain in S P be- 
comes smaller and smaller as the network data rate R n in- 
creases. This may justify the above assumption that the 
queuing time on the network is negligible in high-speed and 
low-latency network. Figure S shows that S p decreases as 
processor speed R p increases. Note that the total execution 
time for an application will still be reduced asi?, increases, 
although S p decreases. 

Figure 9 shows that S P increases as the number of para- 
graphs per page, k. increases up to a certain point. After that 
point, S p slightly decreases as k further increases. Further- 
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more, S p is approximately the same for a fixed paragraph 
size, which Is P/k . This behavior demonstrates usefulness of 
the use of a paragraph with a smaller granularity than a page. 
Figure 10 shows a similar behavior, for S p in relationship 
with the number of hosts N. 

The analysis of this performance model demonstrates the 
effect of using paragraph which has a smaller granularity than 
a page. This smaller granularity reduces the probability of 
false sharing and the amount of data to be transferred over the 
network. The performance of DICE DSM is also going to be 
evaluated by a trace^driven simulation model, which will 
take consideration of network queuing delay and give more 
realistic results. 

The concept of using paragraph is different from that of 
using cache line or from the ones just using small page size. 
Cache-based DSM has been used in multiprocessor systems, 
which needs to build their own interconnected network inter- 
face and use their own message-based communication 


Hanoi* wrtta— «cc**a paragraph fault: t ? 

"SST' muitio* hosts* hotnm host ‘ suta m t „ ag 

sand tnvaWaaon raquasi to ail hosts >n PvT(aj.copy»C ^ 

rec*** •* invalidation confirmation*: 5 

Paring]. copy* at - {}; ", A 

PwTtgJ.oid - mysatt Jj » 

END *2 

else 

3EGIN , . . m 

send wffta_access reouea to PT[pf .hd. f - 

recatva access confirmation from PT[pJ.h«d: 

ENO: ^ 

PvTIg].acc - raad -«*«■: , ]r 

unoiodt host processor 
return: 

Wrtta »cc*«s paragraph fault saryar * 

/t mun/ote frosts* stiti > 'if host stAt# i _ 

sand irrvaadauon raqua* to a* hosts (axcaet raquasung host) m P*tT\Q\.a>Vfsm: u 
recarve ai invalidation confirmations: .J 

Parftgl.cooysat - O: * 

ParT{gJ jcc - none: . 2 

sand accass confirmation to recuasong hose ^ ® 

Parpgl.otd - raquasung hose '2 | 

return: 5. 

Invalidation *arvar >5 

ParTtgl.acc - none: , 3 

send mvaadaoon confirmation to home nose -if # 
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scheme. In contrast, paragraph-based or page-based DSm 3 
used the systems over LANs, using the existing network mterj 
face with standard packet-based or cell-based network com* 
municadon protocols. As compared with small page: snj&t 
paragraph reduces the complexity of the shared memory man=* 
agementdue to the use of small size of page table and the cwog 
layered hierarchical page/paragraph structure while allowing 
a host to continue using the larger size of page as the trendsjnj 
current memory design in uniprocessor computer systenj| 
This reduction of complexity is also due to the using or homef 
hosts in the protocol, which allows easily to locate the desired* 
memory unit while distributing the management of snareg 
memory over the hosts on a LAN. 
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A Two-Tier Paging Scheme for Network-based Distributed Shared Memory Systems 

Chi-Jiunn Jou, Hasan S. AlKhatib, and Qiang Li 

Abstract - Distributed computing over a network of workstations continues to be an illusive goal. Its 
main obstacle is the delay penalty due to network protocol and OS overhead. We present in this paper a low 
level hardware supported scheme for managing distributed shared memory (DSM), as an underlying paradigm 
for distributed computing. The proposed DSM is novel in that it employs a two-tier paging scheme that re- 
duces the probability of false sharing and facilitates an efficient hardware implementation. The scheme em- 
ploys a standard OS page and divides it into fixed smaller memory units called paragraphs, similar to cache 
lines. 

An application address space is viewed as consisting of a shared data region, an unshared data region, a 
stack region and a code region. Code, stack and unshared data regions are handled by the OS in the standard 
manner without modification. The proposed scheme manages the shared data regions only. A hardware exten- 
sion of a traditional MMU, Distributed MMU or DMMU, is introduced to support the DSM. Shared memory 
coherency is maintained through a write-invalidate protocol. An analytical model is built to evaluate the sys- 
tem sensitivity to various parameters and to assess its performance. 

Keywords - distributed shared memory; false sharing; hardware support for distributed computing; 
memory coherency protocol; performance evaluation; networks of workstations. 

1. Introduction 

Despite the tremendous progress made in local area networking over the past decade and a half, the 
operating system and network protocol technologies have yet to address the main obstacle to distributed 
computing, namely the delay due to the network overhead. Network speed has reached several 
hundreds of Mbps, but the real issue is the network overhead latency in addition to sustained through- 
put. 


*This work was supported by NASA-Ames Research Center grants number NCC 2-644 entitled ’’Parallel Processing for 
Scientific Computations". 


The problem consists of a myriad of sub-problems, and is not simple to resolve. It requires a system-wide 
consideration on the full integration of networks into the operating system, and a re-examination of network 
protocols and the overall system architecture, including hardware support for both network protocols and the 
OS. This integrated view is underway in a project at Santa Clara University, called DICE, a Distributed Inte- 
grated Computing Environment [ 1 ]. DICE supports a distributed shared memory paradigm, DSM. This paper 
presents the design and performance of DICE DSM. 

A number of DSM systems based on LANs have been developed over the past decade[l 8]. Among them. 

Ivy [13] is implemented on a network of Apollo workstations. The memory is paged, and copies of pages may 
be replicated in different hosts. A multiple-readers and-single writer strict coherency semantics is used on the 
page level. Memory coherency is maintained via a dynamic ownership protocol with a write-in validate proce- 
dure. The owner of a page is located using either a centralized manager, a group of fixed distributed managers, 
or the individual host which forwards the request. Ivy is designed for multi-threaded applications. All threads 
share the same virtual address space. False sharing may occur in this system, since its consistency or access 
unit (e.g. word) is less than the sharing unit (page). In addition, the single-writer nature of its protocol may 
cause a ’’ping-pong” behavior between multiple writers of a shared page, leading to thrashing. 

The problems of false-sharing and thrashing have been addressed by other DSM systems. Clouds [15] 
avoids them by using a single-writer-single-reader strict coherence semantics introducing instead significant 
blocking delays. Mirage [9] reduces thrashing by using a time window scheme, in which the system guaran- 
tees that the writer of a page retains access to a page for a fixed period of time, suffering again from blocking 
delays. Munin [3] handles it by using multiple consistency protocols and software release consistency, hence 
placing the burden on the user. Mether [14] eliminates false sharing and thrashing by ignoring memory coher- 
ency altogether, leaving its burden to the application software. 

DICE represents a novel approach to handling the problem of false sharing and thrashing. The shared por- 
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tion of memory is structured as a two-tier paging system. The first tier is a normal page, and the second is 
called a paragraph , which is a smaller fixed-size block of memory within a page. Coherency is maintained at 
the level of a paragraph. The introduction of paragraphs improves system performance by reducing the proba- 
bility of false sharing as well as the size of the unit of information transferred over the network for maintenance 
of memory coherency. A Distributed Memory Management Unit, DMMU, an extension of the tradition- 
al MMU, is designed to support the paragraph validation, and a special network controller is used to 
support the accesses to the remote memory and the maintenance of memory coherence. 

Section 2 of this paper gives the overview of the DICE architecture. The design of the DICE distributed 
shared memory is described in section 3. An analytical model and the expected system performance are pre- 
sented and discussed in section 4. Section 5 concludes this work and compares it to other approaches. 

2. Overview of the DICE Architecture 

DICE is an experimental distributed environment for executing multi-threaded tasks. A parallel task may 
consist of multiple threads that can be scheduled to ran simultaneously on a cluster of workstations. Threads 
executing on separate workstations share the same virtual address space, and communicate with each other 
using shared memory. Synchronization of threads accessing shared resources is done using functions provided 
by a distributed run-time library. 

DICE consists of three interactive subsystems. The DSM provides the underlying communication para- 
digm among threads of a parallel task. The DRS (distributed run-time subsystem) provides users with pro- 
gramming tools to develop and execute DICE multi-threaded applications. The PS (parallel scheduler) is a 
self-optimizing application-specific scheduler, and is responsible for thread scheduling and synchronization. 

3. Design Issues of the DICE DSM 

DICE DSM is designed for a cluster of workstations connected via a high-speed, low-latency local area 
network. The architecture of a node in a DICE system is shown in Figure 1 . Each node consists of a host 
processor and a physical memory module. The traditional MMU is replaced by a DMMU. The network inter- 
face is attached directly to the memory bus and contains a network processor and a dual ported memory visible 
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VA: virtual address 

PA: physical address 

I/D: instruction & data path 


DMMU 




Figure 1. The Architecture of a DICE Node 



both to the host and network processors, simultaneously. The dual ported memory holds data structures for 
managing the shared memory. 

3.1. Programmer’s View of DICE DSM Environment 

In DICE, a parallel task consists of multiple threads that can run on acluster of workstations (nodes), simul- 
taneously. Memory pages required by each thread, whether code or data, are allocated physical memory 
blocks, at the respective node, where the thread is running. Shared data pages are distributed and repli- 
cated among the nodes as needed by the threads. The DSM system is designed to support the sharing of 
data pages. The DSM system also maintains the coherency among replicated data copies. 

Each parallel task has a root node , on which it was first loaded and executed. The root node main- 
tains state information for all pages, including shared pages used in the application, while other nodes 
maintain the state information for the pages that are loaded in their local systems. 

Code and non— shared data pages of a thread are loaded in the physical memory of the node where 
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the thread is scheduled for execution. Shared data pages, on demand, are first loaded into the physical 
memory of the node. That node becomes the home node for the page. A home node maintains the com- 
plete state information for its pages. It ensures that the last copy of a page is not purged, and keeps track of all 
copies of paragraphs belonging to its pages. Other nodes in the cluster, that have a copy of a shared page, keep a 
pointer to the page’s home node. When a thread makes an attempt to access apage for which it does not have a 
copy, it interacts with the home node of that page in order to complete the memory access. When a node does 
not know the home for a certain page, a home-info fault is triggered and a home-info request is sent to the root 
node. The root node replies with the information about the home node for the requested page. If a home is not 
yet assigned for the page, the root node assigns the first requesting node the status of home for that page. The 
root node then updates its table and sends the page to the requesting node. The requesting node, upon receiving 

the page and the assignment of home status, updates its page table and creates a paragraph map table for that 
page. 

32 . Coherency Protocol 

The memory coherency of DICE DSM is maintained at the paragraph level. A paragraph can simulta- 
neously be read by multiple nodes, but it can only be written by one node at a time. Access rights to a paragraph 
can be read-write, read-only , or none. An owner node of a paragraph is the node that has read-write access to 
that paragraph. The ownership of a paragraph may be transferred from one node to another upon demand. 
There is no owner for a paragraph, when two or more hosts have read-only access rights to that paragraph. The 

Information about the owner of a paragraph is maintained by the home node of the page containing the para- 
graph. 

When a read operation is issued to a paragraph by a node with none rights, a read fault is triggered and a 
read request is sent to the paragraph’s home. When a write operation is issued to a paragraph by a node with 
none rights, a write-data fault is triggered and a write-data request is sent to the paragraph’s home. When a 
write operation is issued to a paragraph by a node with read-only access rights, a write-access fault is triggered 
and a write-access request is sent to the paragraph’s home. In each case the home directly or indirectly re- 
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sponds with the requested information. The coherency of paragraphs is basically maintained through a write- 
invaiidate protocol. The details of this protocol and its algorithm is shown in [1 1]. 

3_3. Management of Shared Memory 

Page and paragraph tables are used to maintain the state information for shared pages and their paragraphs, 
respectively. Each DICE application maintains its own set of these tables. A Page Table ( PT ), similar to a 
traditional page table, provides the information about mapping the virtual addresses of pages to their corre- 
sponding physical addresses, at their respective nodes. A Paragraph Validation Table (PVT), maintains the 
information about the access rights of the page’s paragraphs. Each entry of a PVT contains a 2-bit field main- 
taining the access rights of the local node to the respective paragraph. Note that there is no address translation 
for paragraphs. Each node keeps a Page Table for Home information (PTH), which maintains the information 
about the homes for its shared pages. Each home node of a page maintains a. Paragraph Table (ParT) for that 
page containing a pointer to the current owner of each paragraph and a list of nodes with read-only copies of 
the paragraph. There is only one ParT for a page in the system. It is maintained by the home node of that page. 
The PT and PVT are maintained in the dual-ported memory, inside the LAN interface. They are used by both 
host and network processors. The PTH and ParT are maintained in the network subsystem, and are only used 
by the network processor. Figure 2 shows the data structures for these tables. 

DMMU is an extension of the traditional MMU. It is designed to support paragraph validation for efficient 
handling of distributed shared memory. When data is not available locally and needs to be fetched from a 
remote node, the DMMU triggers special access faults via an embedded hardware unit, PVLB (Paragraph Vali- 
dation Lookaside Buffer) - to validate the access rights of paragraphs. The DMMU performs the traditional 
TLB operations for all non-shared pages as well. When the DMMU does not find the entry it needs in its TLB , 
it fetches the entry from the appropriate PT in memory. When an entry is loaded from the PT into the TLB, all 
entries of its associated PVT (2 bits per paragraph) are simultaneously fetched and stored into the associated 
PVLB. When an entry of the TLB is replaced, all entries of its associated PVLB are also replaced. Note, there 
are no PVLBs for non-shared pages. 
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In Dual— ported Memory: 
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pfn flags 
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pPVT 
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pointer to PVT 


physical page frame number 


In Network Memory: 
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(set to 1 if local node is home) 


PVT: 


k 1 acc rights 


(for home only) 
ParT: 



oid 


copy set 


owner 


id copyset 


Figure 2. Page and Paragraph Tables for Shared Pages in DICE 


Figure 3 shows the structure of the TLB and the PVLB. Each entry in the TLB contains an address tag, a 
physical page frame number, flags, and an S bit. The S bit is used to distinguish shared pages from non-shared 
pages. Each TLB entry of a shared page has an associated PVLB, which has k two-bit access rights fields, 
where k is the number of paragraphs within a page. The virtual address is grouped into three fields: a page 
number, a paragraph number, and a paragraph offset. The page number is used as a key to match the address 
tags in the TLB, while the paragraph number directly addresses the PVLB entries corresponding to the same 
paragraph number. The latter operation will simultaneously select n PVLB entries, where n is the number of 
PVT Bs in the DMMU. Each PVLB has an associated logic L, which validates the access rights of the refer- 
enced paragraph. By checking the stored two-bit access rights field and the current memory access type RAV, 
logic L generates a Trap signal. The Trap signal is ON when any paragraph validation fault occurs. The Trap 
causes a system trap and requires the software to distinguish the type of the current access fault and resolve it. 
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If there is no Trap, the physical access to the paragraph proceeds without interruption. The function of logic L 
is shown in the table inside Figure 3. The S bit of the selected TLB entry is used as a gate to control the final 
selection of the Trap signal generated from the previously selected n PVLB entries. Note that the operations on 
the PVLB are executed in parallel with the operations on the TLB, except for the final selection of the PVLB 
output. Hence, if a memory reference does not generate a paragraph validation Trap, no significant extra delay 
will be suffered by going through this additional PVLB unit compared to a traditional MMU. 

The control unit of the DMMJJ contains the logic to manage the retrieval of entries from the PTs and the 
PVTs in the dual-ported memory. It also controls the TLB and PLVB update operations, and handles other 
related activities. When the retrieval of the entries of the PVT fails, the DMMU triggers a PVT trap resulting 
into a home-info fault as described in section 3. 1 . Other paragraph validation faults are generated by the PVLB 
as described above. 

4. Performance Analysis 

The performance of a DICE DSM system is mainly affected by the delays encountered in handling differ- 
ent paragraph validation faults, which in turn depends on the execution delay of messages sent over the net- 
work to resolve paragraph faults. In the following analysis, a performance metric is first defined. The system 
and network model is presented. Thereafter, the application behavior model along with the protocol cost are 
described. Finally, the performance results for different combinations of system configurations and applica- 
tion profiles are shown and discussed. 

4.1. Performance Metric 

The performance of parallel systems is often measured in terms of speedup, which is the ratio of the execu- 
tion time of a program run on a single processor to that run on a parallel system. We limit ourselves to the 
speedup for the parallel part of an application only. We define the speedup for the parallel part of an applica- 
tion, S p , as the ratio of the execution time of the parallel part of an application running on a single processor to 

that running on a DICE DSM system. 
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Let us denote T s and T ^ to be the total execution time for the parallel pan of an application by a single 

node and by N nodes in a DICE DSM system, respectively. Let the processor speed of a single node be denoted 
by R p MIPS. Let the total number of instructions required to be executed in the parallel part of the application 

be denoted by I a , and the average rate of shared data memory accesses per instruction be denoted by d s . Then, 


T, = and T d*m = jj + d t T pco« ) (!) 

where T pcoss denotes the average protocol cost per shared data memory access, and will be derived in the fol- 
lowing subsections, using an analytical system model. The term d s I a T^, represents the total overhead, 
when using the DICE DSM. The speedup for the parallel part of an application S p is therefore: 





N 

1 Rp I*pco*< 


( 2 ) 


4.2. Network and System Model 

In this analysis, a high-speed, low— latency ATM network is assumed to be the underlying local computer 
network. The queuing time on the network is assumed to be small enough to be neglected. (A future study is 
e xaminin g the effects of queuing delays.) The memory access unit is assumed to be one word (or four bytes). 
Each paragraph has G words. An application is executed by N nodes. 

A typical ATM network consists of a set of nodes connected via a mesh of switches. In an ATM network, 
data is segmented into small fixed-length cells, routed, then reassembled at the destination using header infor- 
mation contained in the cells. Due to the efficient structure of ATM frames, the waiting time for accessing the 
network can be designed to be very short. In this model, each network message with length Lmsg takes 


t + n cell t var processing time at the transmitting and the receiving nodes, n ceU L ceU jR n transmission time, 
and n ceU processing time through an ATM switch; where n cell is the number of cells needed to transmit the 
whole message, or the ceiling of L„^/{L ca - L w ) \L cell and are cell size and header lengths, respective- 
ly; / and t var are fixed and variable parts of processing delays in the communicating nodes, respectively; R n is 

JIX 
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the network data rate; t ne[ is the average network switch latency a cell goes through in a typical ATM network. 

Note that the processing time at the nodes includes the time for copying data between host memory and net- 
work buffer, network processor latency, interrupt handling on reception of frames, and segmentation/reas- 
sembly times. 

The protocol cost is analyzed based on the time it takes for handling different kinds of paragraph validation 
faults. This analysis includes all but home-info faults, since they only occur when a page is accessed by a node 
for the first time. The fault handling time is expressed in terms of the total time for handling network messages, 
including required interrupt handling delays at the local and remote nodes. 

The whole message for cither fault request or invalidation request can fit into a single ATM cell. The 
messages for data reply will have the size of a paragraph, which may need one, two or more ATM cells depend- 
ing on the size of the paragraph. The costs for these two different sizes of network messages, denoted by re- 
quest messages, msg-r, and data messages, msg—d, are 

* — -I- f 4. f 4. f ( 3 ) 

l m*g-r ~ p T l ffr ‘ l vw T l n«t 
n n 


- \r-hr] b“ + , - + rr-V 

I L c mH I I *- c •// L hd 


From the memory coherency protocol, one can count the number of network messages involved in each 
kind of fault. This message count also depends on the home and owner node relationship, as well as the number 
of nodes within the copyset (the list of nodes with read-only copies of a paragraph), when a fault occurs. After 
examining the protocol, one concludes that the cost of message are as follows: + ^ msg ^ d f° r case e 1 

case nrd, 2f msg _ r + ^ msg . d for case el, (2 N set + 1 )/ w? _ r + t msg _ d for case nwd, and 2W*, for case 
nwa. Here, N set denotes the number of nodes within the copyset, when a fault occurs. Cases el and el repre- 
sent the situation when a fault occurs while the copyset on the home node is empty. The former is the case when 
the owner is the home, or when the requesting node is the home node. The latter is the case when the owner is 
not the home and the requesting node is not the home node. Cases nrd and nwd and nwa represent the situa- 
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tions for a read fault, a write-data fault, and a write-access fault occurrence, when the copyset on the home 
node is not empty, respectively. 

The average time spent for handling a paragraph fault depends on the probability of each of the above 
cases as well as the probability of the number of nodes within the copyset, when a fault occurs. These probabil- 
ities are estimated by simple probability models in this work. When a fault occurs, each node has equal proba- 
bilities of 1/N for having accessed and of ( 1 - 1 /N) for not having accessed this paragraph since the last time 

the copyset was empty. Hence, the probability that the copyset is empty, when a fault occurs, is the case that 
either none or any one node having accessed this paragraph. The probability that the number of nodes within 
the copyset is i, when a fault occurs, denoted by p{N set = i } , is the case when any i+1 nodes have accessed 


the paragraph. Therefore, we have 


P { N « = 0} = (1 - 1)" + (^)(^) 1 (1 - = ( 2 - 1)( 1 

PiN * = /} = ( y + 0 " for/ = I’ 2 * 3 - N - 1 


_1_ \N-1 

AT 


( 5 ) 

( 6 ) 


In the DICE DSM, it is expected that a paragraph is accessed by its home node most frequently. LetXj 

denote the probability that a paragraph is accessed by its home node. Other nodes are assumed to exhibit a 
paragraph access probability that is uniformly distributed among all the non-home nodes with a total probabil- 
ity of 1 — x s . Note that x s reflects the processor locality of parallel program behavior as described in [8]. 

The probability of each case is estimated by finding the conditional probabilities of each case, when either 
read or write access faults occur. The probability of case nrd fault is 1. The rest of the probabilities are 


P»a " 


(N- 2)(1 -x.) 


2 (N - 1)x, + (N - 2)(1 - x.) 


P« 1 1 P* 2 


( 7 ) 


Pmrt( w *t) (N - 1)X, + Njn -x,) + flf - 1 - N -( )(1 -x.) > 1 ^ 


where P eV P e 2 ^P nW( i » and p nwa are the probabilities of case el, case e2, case nwd, and case nwa, respectively. 
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The average time spent for handling a paragraph read or write fault, denoted by and , can be obtained 

from Equations (3) through (8). After some simplification, we have 

~ Pf^set ~ tyP el)^msg -r + f msg-d ) ( 9) 

iV-1 

t wf = [PfNset = 0/(1 + pj + ^ P/2V ie , = i}p nwd mt mg - r + ( 10 ) 

i = 1 

iV-1 

+ tX«W»< -Wta™,-,) 

i = 1 

4.3. Application Behavior Model and Average Protocol Cost 

Torrellas et al. [19] proposed a model of sharing, which is classified into true sharing and false sharing. 
Based on this sharing model, we divide the average protocol cost, , into two parts: one part is caused by 

true sharing misses, the other part is caused by false sharing misses. A miss is a true sharing miss, when a 
processor or node misses, because the word was previously used by another node. A false sharing miss is 
caused by multiple processors or nodes accessing different words within the same paragraph. 

In this analysis, we first consider the application behavior independent of system architecture. The sharing 
misses are based on an access unit (word), as the same way in the work done by Eggers and Katz in [7], instead 
of a coherency unit (paragraph). Then, we integrate it with the effects of using a paragraph size consisting of 
multiple words. 

True sharing misses are varied significantly for different parallel applications, since they inherently de- 
pend on the program behavior. True sharing misses are expected to increase as the number of processors or 
nodes increases, since the frequencies and degrees of sharing increase. Hence, we use a simple linear relation- 
ship to model this behavior. Let/ r and f w denote the average rate of read faults per shared read data memory 

access and average rate of write faults per shared write data memory access, respectively. Then, we have 

and f w =4o +f wx N (11) 

where/^ an d/^ are the base points of/ r and/*, , respectively;/^ and f m are the incremental rates of/. and/ w , 
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when the number of nodes is changed, respectively. Note that/,, and f w reflect the temporal locality of parallel 


program behavior. 

When paragraphs, larger than a single word, are taken into account, the true sharing misses are expected to 
drop as the paragraph size increases. This is due to the spatial locality of a parallel program behavior, and the 
neighboring data having been prefetched before being used. Note that we consider the sharing misses only 
caused by the coherency protocol, and ignore those caused by insufficient physical memory to allocate space. 
We use the ratio of miss ratios, proposed by Smith in [16], to model the effects of this behavior. Let m rl and 


m n denote the ratio of miss ratios when a paragraph size is G to that when a paragraph size is one word, and 


when a paragraph size is G to that when a paragraph size is G/2, respectively. Then, we have 




( 12 ) 


Several research results [2,6,18] indicate that false sharing will be increased, when either the number of 
processors or the coherency unit size is increased. Hence, we also use a simple linear relationship to model this 
behavior. Let denote the probability of false sharing misses. Then, we have 


€ fi = € J* + e fix N + *fi, G 


(13) 


where is the base point of and are the incremental rates of , when N and G are changed, 

respectively. 

Combining Equations (9) through (13), one can derive the average protocol cost as 


T pcost = m r4^ 1 ~ W ^rf + + C /* [(1 “ + Wt w} (14) 

where w denotes the average rate of write operations per shared data memory access. In the above equation, the 
two terms on the right side represent the protocol costs caused by true sharing misses and false sharing misses, 
respectively. 

4.4. Analytical Results 

This section shows the effects of changing system structure and application profile on the speedup, S p . A 
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typical value for each parameter is chosen to reflect a typical system architecture and a target parallel applica- 
tion profile. We analyzed the effects on S p by only changing one or two parameters at a time and fixing other 

parameters to their typical values. 

For program behavior parameters, the typical degree of sharing and access pattern are chosen to be 0. 1 for 
both d s and w. The typical fault rates are chosen to be 0.00 1 for both and , and 0.00 1 for both f n and . 

The typical locality factors are chosen to be 0.6 and 0.5 for m n andx^ . Typical false sharing factors are chosen 
to be 0.0001 for and 0.0000 1 for both e fsc and . These typical values are intended to represent the suit- 
able network-based DSM applications and to reflect the significant effects of localities as well as false sharing. 

For system parameters, the lengths for an ATM cell and header are fixed to 53 and 5 bytes, respectively. 
Other parameters are varied to reflect the changes in of system technology and architecture. The typical system 
is chosen to have 1 6 nodes and 1 00 MIPS . The typical network data rate is chosen to be 1 50 Mbps. The typical 
ATM processing time is chosen to be 10 and 20 microseconds for and t var , respectively. This is derived 

from the actual measurements of an ATM host-network interface in [4]. While, t net is chosen to be 10 microse- 
conds, which corresponds to the store-forward delay time of a single switch for an ATM LAN. 

Figures 4 through 13 show the expected behavior of S p , when the size of a paragraph, G, is changed. This 
behavior indicates that S p increases as the paragraph size G increases up to a certain point. After that 
point, S P starts decreasing as G further increases. The peak values of S p is when the paragraph size G is be- 
tween 32 and 256 bytes. This is less than the page size used in a common operating system. This behavior 
demonstrates the advantage of using a two-tier paging scheme. Note that the fixed small cell size (53 bytes) 
used in ATM networks leads to the abnormal dent at a granularity of 64 bytes shown in Figures 5 through 9 and 
11 . 

Figures 4 and 5 show that S p decreases as the average rate of shared data memory accesses per instruction 
d s , and the average rate of write operations per shared data memory access w increases, respectively. Figures 6 
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and 7 show that S P decreases as the fault rate parameters (i- e, /rO 'fwQ'frx ,and fng ) increase. Figure 8 shows 
that Sp decreases as the ratio of miss ratios m n increases. Figure 9 shows that S p decreases as the false sharing 
parameters (i.e. , and ) increase. Figure 10 shows that $ P increases as the probability of a para- 

graph being accessed by its home node x s increases. 

Figure 1 1 shows that S P decreases as the processor speed R p increases, as the benefits of parallel processing 
diminish due to the increase in ratio of network overhead to execution time on each node. Note that the total 
execution time for an application will still drop a sR p increases, although S P decreases. This asserts an impor- 
tant expected fact that as processor speeds increase, it is important to reduce network overhead in order to ac- 
complish the same high level of speedup. 

Figure 12 shows that S P increases as the network data rate R n increases, and that the margin of gain in 
S p becomes smaller and smaller as the network data rate R n increases. Figure 13 shows that S p decreases as 
the AIM processing and switching times (i.e. t , t var , and t net ) increase. 

Figures 14 and 15 demonstrate the relationship of S P and S p /N with the number of nodes for different 
paragraph sizes, respectively. S p increases as N increases, and the margin of gain in S P becomes smaller when 
N is large. 

S. Conclusions 

In this paper, we present the design of a two— tier paging system for distributed shared memory, 
where a paragraph, a much smaller memory unit than a page, is employed as the unit of coherency. The 
system is modeled and the analysis demonstrates the benefits of the multiple granularity memory manage- 
ment. The problem of false-sharing is alleviated, especially for systems with large page size and large objects. 
The network latency for coherence maintenance is significantly reduced, since only a small amount of data has 
to be transferred across the network for each remote memory access fault. Furthermore, the overhead of the 
coherency protocol processing is reduced by introducing hardware support. 

The proposed two-tier paging scheme is different from the two-level paging method used in a uniproces- 
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sor system. The latter bears two levels of address translations. In our two-tier paging design, the page is the 
only address translation unit and the paragraph is the validation unit. There is no address translation for para- 
graphs. 

The concept of using a second tier page, namely a paragraph, is different from that of using a cache line. 
The size of a paragraph is normally larger than a cache line. Although the paragraph coherency protocol and 
algorithm is similar to the one used in cache-based DSM multiprocessor systems, the design and implementa- 
tion consideration are quite different. In a network based distributed shared memory system communication 
latency is significantly higher than that seen in a multiprocessor distributed shared memory system such as 
DASH [12]. Network based DSMs are implemented in software with hardware support, while multiprocessor 
based DSMs are implemented in hardware. Therefore, the allocation of and access to the coherency directories 
are quite different. 

The use of paragraphs as opposed to using a small page size reduces the complexity of the shared memory 
management . If a small page size is used, very large page map tables will be required. By preserving the large 
page size and using paragraphs only for shared pages the page map tables stay small and additional paragraph 
map tables are needed for shared pages only. In addition to using the home node scheme we have distributed 
the management of paragraphs to the home nodes of the pages only. Hence, the root node acts as the clearing 
house for all application pages, and the home nodes act as the clearing houses for the paragraphs in their respec- 
tive pages to which they are home. 

A trace-driven simulation model that takes into consideration network queuing delays is under develop- 
ment. This simulation model will be used to validate the analytical model described in section 4. This simula- 
tion model is built with BONeS DESIGNER[5], and the traces are generated by Tango Lite[ 10] when running 
the parallel applications of Stanford SPLASH[16]. 

The current DICE DSM design is based on a strict consistency model and a write-invalidate coherency 
protocol. Extensions by using multiple consistency and coherency protocols are under consideration. In future 
version of DICE we plan to incorporate support for a relaxed consistency model to hide the large latency of 


19 



remote memory accesses by allowing buffering and merging. 
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Abstract - Often, the computing power of networks of 
workstations is left unused. The objective of this project is to 
develop a set of tools to take advantage of this potential com - 
puting power and to create a platform suitable for large scien- 
tific computations. This paper presents the architecture of a 
Distributed Integrated Computing Environment (DICE) 
consisting of a cluster of networked workstations. DICE 
consists of three interactive subsystems. DSM ( distributed 
shared memory) provides the underlying communication 
and computing paradigm for threads of a parallel task to ex- 
ecute on a cluster of cooperating workstations. DRS ( distrib- 
uted run-time subsystem) provides users with programming 
tools to develop and execute DICE multi-threaded applica- 
tions. PS (parallel scheduler) is a self optimizing application 
specific scheduler and is responsible for thread scheduling 
and synchronization. 

1. Introduction 

The majority of scientific applications require a fairly 
large amount of memory to execute a task. If a task is to 
be partitioned into threads (sub-tasks) that are executed 
in parallel, memory sharing is very desirable, since it al- 
lows sharing variables among threads within the same 
task. Also, software based on shared memory is more 
portable and machine independent as compared to that 
of distributed memory which is architecture dependent. 
The shared memory multiprocessor system has been 
more and more popular for executing large scientific 
applications for these reasons. 

On the other hand, there is a tremendous amount of 
computing power that is left unused in networks of work- 
stations. Very often a workstation is simply sittmg idle on 
a desk. A set of tools can be developed to take advantage 
of this potential computing power to create a platform 


•This work was supported by NASA- Ames Research Center grants 
number NCC 2-644 entitled ’Parallel Processing for Scientific 
Computations \ 


suitable for large scientific computations. The integra- 
tion of several workstations into a logical cluster of dis- 
tributed, cooperative, computing station presents an al- 
ternative solution to shared memory multiprocessor 
systems. 

DICE (Distributed Integrated Computing Environ- 
ment) is designed to meet these objectives. DICE em- 
ploys virtual memory supported distributed shared 
memory(Z)SAf) as its underlying computing and commu- 
nication paradigm. It integrates DSM with a parallel 
scheduling as well as a parallel programming subsystem. 
In Figure 1, a distributed task T is running on four work- 
stations, while a distributed task y 2’ is running on three 
workstations. These distributed tasks are independent 
of each other, and a workstation may have threads of two 
or more tasks running on it, concurrently. 

This paper presents the DICE architecture in the fol- 
lowing sections. Section 2 identifies the related work in 
this area. Section 3 describes the system architecture of 
DICE . It consists of three subsystems, which are de- 
scribed in sections 4 to 6, respectively. The interaction 
among these subsystems is delineated in section 7. The 
expected system performance is shown in section 8. Fi- 
nally, section 9 gives a summary of the results accom- 
plished with this work. 
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2. Related Work 

There are several systems designed to utilize the pro- 
cessor power of idle workstations. These systems mclude 
Sprite [24], V system [33], NEST [1], Butler [23], Condor 
[20], REM [30], Stealth [17], and Sidle [16]. These sys- 
tems provide remote execution or process migration faci- 
lities. In addition to these features, DICE provides the 
distributed shared memory (DSM) paradigm while using 
these idle workstations. 

A number of DSM systems over LANs have been de- 
veloped recently [31]. Among them. Ivy [18,19] is im- 
plemented on a network of Apollo workstations. The 
memory is paged, and copies of pages may be replicated 
in different hosts. Strict coherency semantics are used, 
and the memory coherency is maintained by a write-in- 
vaiidate with dynamic ownership protocol. The owner of 
a page is located via either a centralized manager, fixed 
distributed managers, or an individual host which for- 
wards the request. Ivy is used for applications employing 
a multi-threaded task. All threads share the same virtu- 
al address space. False sharing may occur in this system, 
since its consistency or access unit (eg. word) is less than 
the sharing unit (page). In addition, the singie-wnte na- 
ture of its protocol may cause a "ping-pong” behavior be- 
tween multiple writers of a shared page, or the thrashing 
problem. 

To overcome false-sharing and thrashing, some sys- 
tems employ special schemes. Mach [14] supports the 
DSM with a shared memory server. False-sharing and 
thrashing are handled by fault scheduling via a queuing 
mechanism [13]. Clouds [27,2] is an object-oriented dis- 
tributed operating system where objects can migrate 
across processors. False sharing and thrashing are 
avoided , since Clouds uses a single-writer-single-read- 
er strict coherence semantics. 

Mirage [12] is a DSM system implemented in the ker- 
nel of the Locus distributed system [34]. Thrashing is re- 
duced by using a time window scheme, in which the sys- 
tem guarantees that the writer of a page retains access to 
a page for a fixed period of time. Munin [6,7] is a DSM 
system implemented on the top of the V kernel [9], 
which allows programmers to associate types with shared 
data. Hence, multiple consistency protocols can be used. 
A delay write update scheme is used for a read-mostly 
protocol. Hence, thrashing can be reduced by using dif- 
ferent combinations of data types. 

Mether [21,22] is a software DSM implemented on 
SunOS 4.0. It allows a process to access memory as either 
consistent or inconsistent, and only a subset of a page to 
be transferred. It also provides both demand-driven and 
data-driven semantics for updating pages. All of these 
operations are encoded in a few address bits in the virtu- 


al address. False sharing and thrashing is reduced 
through the use of the incoherent memory. 

DICE presents a novel approach to handle the prob- 
lem of false sharing and thrashing. The shared memory 
is structured as a two-tier paging system. The first tier is 
a page, which is the common page used in an operating 
system. The second tier is a paragraph , which is a smaller 
fixed-sized block of information contained within a page. 
The mtroduction of small paragraph size improves sys- 
tem performance, since it reduces the chance of false 
sharing and the amount of data needed to be transferred 
over the network. 

Distributed run-time system, DRS is another pan of 
DICE . A survey of object-onented languages for paral- 
lel environments is presented in [36]. Other program- 
ming languages and systems developed for distnbuted 
systems are presented in [4], Amber [8] and Orca [5,32] 
are two such systems. 

Parallel scheduler, PS is the third pan of DICE. Sev- 
eral approaches are taken by researchers at work on the 
problem of parallel scheduling. They range from cen- 
tralized control where global knowledge of the system is 
maintained in one place [25,26], to distnbuted control 
where all nodes have equal knowledge of the system. 
Methods used vary from Baysian decision theory [28] to 
data flow graphs [10]. 

The parallel scheduler in DICE is an extension to our 
prior work done in MOPPS [3], MOPPS is a self-tuning 
parallel scheduler. It panitions the given application 
into small tasks, schedules and coordinates these tasks 
among network resources, and main tarns a balanced load 
between workstations without overburdening the com- 
munication network. 

3. System Structure 


DICE is an experimental system which aims at pro- 
viding a computing environment for the execution of 
multi-threaded tasks. Figure 2 illustrates the system 
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structure of DICE. A parallel task may consist of multi- 
ple threads that can be scheduled to run on a cluster of 
workstations, simultaneously. A thread is an active entity 
that provides the notion of a computation. Threads on 
separate workstations also share the same virtual ad- 
dress space, and communicate with each other using 
shared memory. Synchronization of threads to access 
shared resources is done usmg functions provided by the 
distributed run-time library. 

4. Distributed Shared Memory 


sized paragraphs. Paragraphs are used as the unit for co- 
herency. Pages are used as the unit for sharing. Memory 
is allocated in a segment which may contain one or more 
pages. Figure 4 illustrates the hybrid nature of this 
memory structure. 

Coherency Protocol 

DICE mainly provides the computing environment 
for the execution of multi-threaded tasks. A parallel 
task consists of multiple threads that are scheduled to 
run on a cluster of workstations, simultaneously. The 


DICE DSM system consists of a cluster of worksta- 
tions connected by a high-speed and low-latency local 
area network. Other than a host processor and memory, 
each node also has a network processor and a Distributed 
Shared Memory Management Unit (DSMMU). DSM MU is 
an extension of the traditional MMU to allow DSM to 
handle shared memory efficiently. When data is not 
available locally and needs to be fetched from a remote 
host, DSMMU will trigger special access faults. Other- 
wise, DSMMU just performs the traditional TLB opera- 
tions. An example of the architecture of a smgle host sys- 
tem is shown in Figure 3. Note that this example uses 
dual-ported memory, which allows both host processor 
and network processor to access the data structures for 
managing shared memory. 

Each page of DICE DSM is the same as the common 
page in a typical operating system, such as the SunOS. 
Each page is further divided into several small equal- 


shared data of memory pages are also distributed and 
replicated among these hosts. The DSM system sup- 
pons the shanng of those pages, and maintains the co- 
herency among replicated data copies. Each running 
application has a root host , on which it was loaded and ex- 
ecuted. The root host maintains the state information 
for all shared pages used in the application, while other 
hosts maintain the state information for only the shared 
pages that are used in their local systems. 

DICE is a home-based virtual DSM system, in which 
each shared page has a home host. A home host main- 
tains the state information for its pages, ensures that the 
last copy of a page is not purged, and keeps track of all 
copies of the paragraphs within its pages. Other hosts 
only keep the information about the locations of the 
home host. A remote request for handling memory ac- 
cess faults is always sent to the home host of the target 
page. When a host does not know the home host for a 
certain page and tries to access it, a home-info fault will 
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root host. If the home host is not yet assigned, the root 
host will assign the first requesting host to be home host 
of that page, update its own database, and send back a 
reply confirming this assignment. Otherwise, the root 
host simply sends back a reply giving the information of 
the home host for that page. 

The memory coherency of DICE DSM is maintained 
on paragraph level. Each paragraph has an owner host , 
which has the ownership of this paragraph. .An owner 
host always has an up-to-date copy of its paragraph, and 
is the only host which permits to wnte to the paragraph. 
The ownership of a paragraph may be transferred from 
one host to another according to the coherency protocol. 
Information about the current owner of a paragraph is 
maintained at the home host of the page contammg the 
paragraph. 

A paragraph can simultaneously be read by multiple 
hosts, but it can only be written by its current owner host. 
The access right of a paragraph for a particular host may 
be either read-write , read-only , or none. A host can ac- 
quire or upgrade its access rights by sending requests to 
the home host of the page in which the desired paragraph 
resides. 

A host can immediately perform read and wnte oper- 
ations on a paragraph if it has read-write access for that 
paragraph, or perform read operations on a paragraph if 
it has read-only access for that paragraph. When a read 
operation is issued to a paragraph with none nghts, a 
read-data fault will be triggered and a read-data request 
will be sent to its home host. When a write operation is 
issued to a paragraph with none rights, a wnte-data fault 
will be triggered and a write-data request will be sent to 
its home host. When a write operation is issued to a para- 
graph with read-only access , a write-access fault will be 
triggered and a write-access request will be sent to its 
home host. 

When a page is initialized, the home host is the default 
owner host for all paragraphs within this page. Any other 
host will send a remote request to the home host when it 
tries to access any paragraph of this page. If a read-data 
request is received, the home host will return back a re- 
ply containing the most recent copy of the desired para- 
graph when itself is the owner host of that paragraph. 
The access rights of both home and requestmg hosts are 
changed to read-only. If itself is not the owner host, the 
home host will forward this read-data request to the 
owner host of that paragraph. The latter changes its ac- 
cess right to read-only , andnhen directly sends to the re- 
questing host a reply which contains the most recent copy 
of that paragraph. After it receives the reply, the re- 
questing host changes its access right to read-only and 
sends to the owner host an acknowledgement with the 
received reply. Having received this acknowledgement 
with reply, the home host also changes its access right to 


read-only and becomes the owner host of that paragraph. 
If itself is the requesting host, the home host will directly 
send t he read-data request to the owner host. The latter 
changes its access right to read-only , and then sends back 
a reply which contains the most recent copy of that para- 
graph. Having received this reply, the home host 
changes its access right to read-only and becomes the 
owner host of that paragraph. 

If a wnte-data request is received, the home host will 
return back a reply contammg the most recent copy of 
the desired paragraph when itself is the owner host and 
no other host has a valid copy of that paragraph. If multi- 
ple valid copies exist, the home host will send invalidate 
requests to all hosts on which those copies are located, 
and wait for confirmations from ail of them before re- 
turning back the reply. Upon receiving the invalidate re- 
quest, each host changes its access right of that para- 
graph to none and returns its confirmation to the home 
host. The access right of the home host is then changed 
to none , while the requestmg host becomes the owner 
host and its access right is changed to read-wnte . If itself 
is not the owner host, the home host will forward this wn- 
te-data request to the owner host of that paragraph. The 
latter changes its access right to none , and then directly 
sends to the requesting host a reply which contains the 
most recent copy of that paragraph. .After it receives the 
reply, the requestmg host changes its access right to 
read-write and sends an acknowledgement to the owner 
host. Having received this acknowledgement, the home 
host updates its database and indicates that the request- 
ing host becomes the owner host of that paragraph. If 
itself is the requestmg host, the home host will durectly 
send the write-data request to the owner host. The latter 
changes its access right to none, and then sends back a 
reply which contains the most recent copy of that para- 
graph. Having received this reply, the home host 
changes its access right to read-wnte and becomes the 
owner host of that paragraph. 

If a write-access request is received, the home host will 
return back the wnte-access confirmation when no other 
host has a valid copy of that paragraph. If three or more 
valid copies exist, the home host will send invalidate re- 
quests to ail hosts (except itself and the requestmg host) 
on which those copies are located and wan for confirma- 
tions from all of them before returning back the wnte-ac- 
cess confirmation. Upon receiving the invalidate request, 
each host changes its access right of that paragraph to 
none and returns its confirmation to the home host. The 
access right of the home host is then changed to none , 
while the requestmg host becomes the owner host and its 
access right is changed to read-wnte . If itself is the re- 
questmg host, the home host will directly send the invali- 
date requests to all hosts (except itself and the requestmg 
host) which have valid copies of that paragraph and wait 
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for confirmations from ail of them. Upon receiving the 
invalidate request, each host changes its access right to 
none and returns its confirmation to the home host. The 
home host then changes its access right to read-wnte and 
becomes the owner host of that paragraph. 

5, Distributed Run-time Subsystem 

DICE DRS transforms the DICE DSM from a flat 
space into an object-oriented structured space. DRS 
consists of a set of tools that implement DICE. Applica- 
tion Programmer's Interface, API, provides users with 
programming tools to develop and execute DICE multi- 
threaded applications. The tools used during program 
development include a parallel language and its compil- 
er, library interface functions, a linker, and other system 
services. 

A new Object-Oriented Dataflow Language ( OODL ) is 
being designed as the parallel language used in DICE . 
One of the important features of object-oriented pro- 
gramming is information hiding and encapsulation 
[11,29]. It provides a higher level of data abstraction in 
modeling real world objects. Such constructs are helpful 
in designing parallel programs [35]. In general, parallel 
programs are difficult to design because the programmer 
must consider multiple execution threads instead of a 
single thread. All possible interactions among the 
threads must be considered. Also, parallel programs are 
hard to maintain because a simple change may affect the 
interaction pattern and results in global consequences. 
Information hiding helps in reducing possible interac- 
tions that need to be considered, while data encapsula- 
tion helps in minimizing the maintenance effort when 
program changes are needed. 

While the object-oriented model provides a high level 
of programming abstraction, it does not naturally exploit 
parallelism of applications constructed with objects. A 
dataflow model can expose and exploit the maximum 
amount of parallelism, as well as express data depen- 
dence from different levels of abstraction in a very natu- 
ral way. The combination of the object oriented and da- 
taflow concepts makes it easier for programmers to 
design large scale multi-threaded parallel programs, and 
to build re-usable concurrent software modules. 

The OODL language, in DICE , is an extension of 
C + + . Dataflow constructs are added to allow program- 
mers to express parallelism explicitly. The parallel com- 
piler can be realized using a preprocessor to translate the 
extended source code into C++ programs, which in 
turn are compiled into object code using an existing 
C++ compiler. 

The run-time library interface functions provide a col- 
lection of library routines that are linked with each paral- 


lel program. They are invoked to support the service re- 
quests made by system processes at run-time. The 
OODL compiler will use these functions to realize the 
parallelism expressed in the application programs. 
These functions can also be used by the application di- 
rectly. 

The linker will create a standard execution file such as 
a.out and an execution dependency tree called a.tree. 
The information kept in the dependency tree includes 
the names of the parallel threads; information about the 
resources of the threads, such as starting address and 
memory requirements; and the predecessors and succes- 
sors of each thread. This information will be used by the 
parallel scheduler to create and allocate shared memory 
segments, and to schedule threads on different worksta- 
tions at run-time. The linker will arrange shared vari- 
ables into shared segments, to simplify the management 
of shared memory by the DSM subsystem. 

DRS also provide services for executing applications 
at load-time and run-time. These services include the 
use of the DICE daemon(s), as well as the automatic cre- 
ation of a root process and alias remote processes for a 
parallel task. 

For each workstation that participates m DICE T a dae- 
mon process has to be present. This daemon is responsi- 
ble for invoking DICE alias processes on remote worksta- 
tions. Each DICE application creates a root process 
when it starts. The workstation where the root process is 
running is referred to as the root workstation. A DICE 
application may have zero or more alias processes. An 
alias process is created by the root process on a remote 
workstation through a DICE daemon as needed. 

The root process is a multi-threaded process which 
runs on the root workstation. It is created when the par- 
allel task is submitted to the system. In DICE* the thread 
is the unit of execution, while a process is the unit of re- 
source allocation. Each process contains one or more 
threads. The root process provides the virtual address 
and system resources for threads running on the root 
workstation. The root thread is the first thread of a par- 
allel task. It is responsible for creating the parallel 
scheduler and DSM manager threads before any applica- 
tion threads start to run. It then becomes the first appli- 
cation thread running on the root workstation. The root 
process terminates when the parallel task is done. 

An alias process is a reincarnation of the root process 
on each remote workstation. An alias process is created 
when a thread is scheduled to run on a remote worksta- 
tion for the first time. The alias process supports the 
same virtual address as the root process and system re- 
sources for threads running on its workstation. These 
threads include an alias primary thread, DSM manager, 
and application threads. An alias primary thread is re- 
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sponsible for creating its local DSM manager as well as 
the first application thread running on its local worksta- 
tion. This alias primary thread, then, listens to thread- 
create requests commg from the network. Subsequently, 
it creates these requested threads of its own parallel task 
on its local workstation. The alias primary thread and 
DSM manager of a remote workstation will remain when 
ail of its application threads are terminated. The alias 
primary thread waits for thread-creation requests from 
the parallel scheduler, while DSM manager waits for 
memory access requests from other workstations. When 
the root process is done, the parallel scheduler sends out 
a termination signal to ail the alias processes of that par- 
ticular task. This is to ensure that all alias processes are 
terminated before the termination of the root process. 

The DICE daemon process is a server that is responsi- 
ble for invoking alias processes on a remote workstation. 
After invoking an alias process, the daemon process will 
have nothing to do with this application task. It will go 
back to listening to requests from the network. If a work- 
station does not want to participate in DICE, it can simply 
terminate this daemon process. A DSM manager is an 
active entity on each workstation responsible for handl- 
ing memory access faults. Each DSM manager maintains 
a memory mapping table that maps each memory page to 
its local workstation or other remote workstations. 

6. Parallel Scheduler 

DICE PS is a seif-optimizing application-specific 
scheduler. It is responsible for thread scheduling and 
synchronization. PS is implemented as a thread within 
the parallel task. Each parallel task has one PS running 
on the workstation where the task initially starts to run. 
This special thread is created during the task load-time. 

When an application needs to create another thread 
or to terminate itself by joining with other threads, it 
passes control of execution to the PS . The PS will find 
the fastest way to run the application by using the infor- 
mation in a Task Execution Dependence Tree , which is 
created as an auxiliary file during the compilation of the 
source program. 

The PS decides whether the local workstation has 
enough resources to run the different threads, which 
threads to send to remote workstations to run, and which 
remote workstations to send them to. It uses several 
tools to make intelligent decisions at run time. Those 
tools are: a CPU load estimator , a network load estimator ; 
an intelligent database , and a bidding process. 

The CPU load estimator runs on every workstation on 
the network and keeps track of the load on that worksta- 
tion. When the time comes to run a thread on the local 
CPU, PS looks at the CPU load estimator for information 


about the load on the local CPU. Similarly, when a bid 
arrives at a workstation, the decision whether to accept 
the bid or not depends partially on the readings taken by 
the CPU load estimator. 

The network load estimator monitors the traffic on the 
network. The network load estimator gives PS an up-to- 
date reading of the network traffic. Smaller partitions 
that takes a relatively short time to execute can become 
too expensive to ship if transmission times become too 
sever. In that case, it might be better to keep them on the 
local workstation, defer shipping them, or combine two 
or more into larger partitions. 

The network load estimator has the responsibility to 
provide PS with real time network traffic information. 
The network load estimator can be as simple as a bus moni- 
tor which continually updates a register (interpreted as 
an integer) signify network utilization levels of high, me- 
dium, or low. 

A small and efficient database records thread per- 
formance on each workstations under different CPU and 
network load conditions. This database allows the bid- 
ding process to generate a reasonable estimate of the ex- 
pected run time of a thread on a particular target work- 
station. 

The intelligent database is designed to categorize dif- 
ferent higher level operations of modules and parame- 
terize their computational and communication time re- 
quirements. The contents of intelligent database are 
tailored to the installation where it resides. The data- 
base is initiated with the types of applications being run, 
and its contents are updated as new applications are in- 
troduced. 

When PS decides that it is best to send some threads to 
a remote workstation to run, it needs a way to pick those 
workstations. Instead of forcing other, possibly heavily 
loaded, workstations to take some of the threads, PS asks 
for help through the bidding process . It simply asks for 
help in running a given thread and tells the other work- 
stations about the memory and CPU requirements of the 
thread. This information is found in the intelligent data- 
base . 

Upon each task completion, the intelligent database is 
updated to reflect the most current experience. When 
no data is available about an application, we can run it the 
first time with gross overestimates, or underestimates, 
and let intelligent database learn about it. Simulation may 
also be used to obtain initial estimates. 

It is essential that intelligent database be queried and 
updated quickly as it would be a system bottleneck and 
might slow down the entire system if not properly de- 
signed. Ultimately intelligent database can be implem- 
ented in hardware as a content addressable memory. 
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In the bidding algorithm, PS weighs execution time 
versus shipping and management tune for each resident 
module. If execution tune is greater than shipping and 
management time and the loads on the local workstation 
is higher than a predefined threshold, the parallel sched- 
uler broadcasts a global message through the network 
asking for help. This ’’help wanted” message includes 
enough information about the module to be sent enab- 
ling other workstations to determine if they can offer 
their help. The information includes the estimated mod- 
ule execution time, memory and disk requirements, and 
any other information that is useful in making the deci- 
sion. 

Those workstations which can potentially bid to accept 
the module for processing will examine this workload in- 
formation and determine whether it is feasible to bid. If a 
workstation is capable of assisting, it will return a mes- 
sage stating its availability, and will commit to this bid for 
a period long enough for the asking workstation to re- 
ceive the return message and act on it. Through this pro- 
cess, workstations that bid for help and are not accepted 
will waste little time before considering later ’’help 
wanted” messages. 

Each workstation will monitor the network before 
sending its reply to determine if any other workstations 
have responded to the bid and will not send it reply if any 
workstation did respond. It is assumed that the first 
workstation to reply will get the job, and there is no need 
for others to do so. PS sends the module to the first 
workstation that replies to the request. 

PS repeats the help wanted messages for a given task 
until either it receives a response or the task is at the 
point where it has to be executed in order not to delay the 
rest of the tasks. 

As a thread is scheduled on a remote workstations to 
run, its respective virtual address space segments are al- 
located physical memory blocks on the same worksta- 
tions. PS takes the consideration of available memory re- 
source on a workstation when scheduling a thread over 


there. 

7. Interactions and Integration 

DSM . DRS, and PS are three separate subsystems of 
DICE. They interact with each other to provide an inte- 
grated environment and to cooperatively work to provide 
the distributed computing paradigm for a parallel task. 

After a parallel task is compiled and linked, a task ex- 
ecution tree file a.tree is created. PS uses this tree to 
perform the parallel thread scheduling. When a thread 
is to be created, the root process will transfer execution 
control to PS. The latter will use a.tree file to schedule it 
on a local or remote workstation, and then transfer ex- 
ecution control back to the application. Similarly, the ex- 
ecution control will be transferred to PS when a thread 
terminates itself by joining other threads. Figure 5 shows 
the overall interaction between DRS and PS. The paral- 
lel compiler and linker create the image of virtual 
memory segments and the task execution tree. PS is in- 
voked when a thread needs to fork or join with other 
threads. 

Furthermore, the root thread of DRS is responsible 
for creating PS. The alias thread on each remote work- 
station listens to remote thread creation requests sent by 
PS, and creates threads locally. 

Similarly, the active entity of the DSM subsystem 
DSM thread is created by the root thread or alias threads 
on different workstations. In the meantime, the data 
structures needed by DSM thread are also created and 
initialized. 

The efficiency of handling shared memory by DSM 
subsystem is significantly affected by the layout of shared 
variables on DSM memory segments and the allocation 
of physical memory on different workstations by the par- 
allel programming subsystem and parallel scheduler. 
Figure 6 shows an example of the run time behavior of 
the DSM subsystem. 
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In Figure 6, the virtual address space of the parallel 
task is on the left side. Each shadowed paragraph within 
the virtual address space represents a single virtual 
memory segment. The physical address spaces on differ- 
ent hosts are on the right side. The shadowed paragraph 
within a host denotes a block of a physical memory, and 
the other structure represents the segment map table. 
The paragraphs with arrowheads represent the corre- 
sponding mappings between the memory segments and 
the physical memory blocks on different hosts. 

8. Performance and Discussion 

The performance of DICE DSM system has been stu- 
died using an analytical model, which derives an expres- 
sion for the speedup of the parallel pan of an application (or 
S p ). The effects of changing S p on system structure 
and application behavior is shown and discussed in [15]. 
Some of these results are shown in this section. The sys- 
tem and application parameters used in this model are 
summarized in Thble 1 in Appendix. 

High-speed and low-latency ATM LAN is assumed in 
this model. We also assume that queuing time on the 
network is negligible. This assumption is justified by the 
results shown in Figure 7 (Appendix), which indicates 
that the gain in S p becomes smaller and smaller as the 
network data rate R n is increased. Figure 8 (Appendix) 
shows that S p decreases as processor speed R p in- 
creases. Note that the total execution time for an appli- 
cation will still be reduced as R p increases, although 

S p decreases. 

Figure 9 (Appendix) shows that S p increases as the 
number of paragraphs per page, k. increases up to a cer- 
tain point. After that point, S, slightly decreases as k 


further increases. Furthermore, S p is approximately 
the same for a fixed paragraph size, which is P/k . This 

behavior demonstrates usefulness of the use of a para- 
graph with a smaller granularity than a page. Figure 10 
(Appendix) shows a similar behavior, for S p in relation- 
ship with the number of hosts N. 

9. Conclusions 

In this paper, we presents the architecture of a distrib- 
uted computing environment DICE, which integrates 
distributed shared memory with parallel scheduling and 
distributed run-time management. The analysis of per- 
formance model demonstrates the usefulness of the use 
of a paragraph with a smaller granularity than a page in 
DICE system. This smaller granularity reduces the 
chance of false sharing and the data size needed to be 
transferred over the network. 

The coherency protocol for this two-tier paging sys- 
tem is also being simulated in software. The perform- 
ance of DICE DSM is also being evaluated using a simu- 
lation model, which takes into consideration network 
queuing delay. The Object-Oriented Dataflow Lan- 
guage and self-tuning Parallel Scheduler are under de- 
velopment. 

The current DICE DSM design is based on the strict 
consistency model and write-invalidate coherency pro- 
tocol. This design is intended to be extended by using 
multiple consistency and coherency protocols. Multiple 
protocols will be used to tailor broader application re- 
quirements. DICE will incorporate the DSM design 
with a relaxed consistency model to hide the large laten- 
cy of remote memory accesses by allowing buffering and 
merging. 
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Appendix 


parameters meanings 

N the number of hosts executing an application _ 

Rn network data rate 

Rp [processor speed 

M the total bytes of shared memory space for the running application 

P page size 

k the number of paragraphs per page 

d the percentage of data memory accesses for total instructions 

Nrf the number of read faults per 1,000,000 memory referenced per host 

Nwf the number of write faults per 1,000,000 memory referenced per host 

Xs spatial locality factor 

No, Nl, g temporal locality factors 

Tablel. System and application parameters in the performance model. 



M=64kbytes, P=4kbytes, d=0.4, Nrf=500, M=64kbytes, P=4kbytes, d=0.4, Nrf=500, 

Nwf=10, Xs=0.5, N0=10, Nl=100, g=100. Nwf=10, Xs=0.5, N0=10, Nl=100, g=100. 



Nwf=10, Xs=0.5, N0=10, Nl=100, g=100. Nrf =500, Nwf=10, N0=10, Nl=100, g=100. 
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